library(fpp3)05_7_A_TS_CrossValidation
References
NOTE: the following material not fully original, but rather an extension of reference 1 with comments and other sources to improve the student’s understanding:
- Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
- Fable package documentation
- https://fable.tidyverts.org/index.html
- https://fable.tidyverts.org/articles/fable.html
Packages
Time Series Cross-Validation
Introduction to the concept
Evaluating accuracy on a single test-train split yield results that are very dependent on the particular train-test split performed.
To achieve more statistically robust results, we may:
- Compute error metrics on many different train-test splits
- Average the results obtained for each split.
This is simple idea is what we refer to as cross-validation.
Cross-validation in time series
Cross-validation applied to time series consists in:
- Generating a series of train-test splits. These splits must be done considering the aspects below:
- These splits must respect the time-order structure of the data. Unlike in other areas of ML, where the train-test splits are done based on different random sampling techniques, here the time-structure of the data must be respected.
- Training sets need a minimum size \(\rightarrow\) earliest observations cannot be used as test sets or the training set would be too small.
- The smallest training dataset should be at least as large as the maximum forecast horizon required. If possible the smallest training dataset should be bigger than this.
- The training set shall only contain observations that occurred prior to the observations of the test set.
- All the test sets must have the same length and the must be placed in time after the training set.
- Compute the point forecast accuracy metrics for each of the different train-test splits generated and average them.
Again, it is essential to respect the time-structure of the data when performing these splits.
Visualizing the train-test splits:
Problem: given a time-series, we are to fit multiple models and evaluate their performance using cross-validation. Input: the corresponding values of the time series \(y_t\) at a series of points in time depicted below in grey. For simplicity, I only depict the points in time, without depicting the values \(y_t\):
Instead of splitting this time series in a single train-test split, we are going to generate multiple train-test splits based on this original time series.
Example 1: one-step ahead forecasts
Let us start with a simple example, in which:
- Train tests have an initial size and grow one observatin at a time.
- Test sets consist in a single observation and a single forecast horizon h = 1.
The diagram below represents this situation:
- Blue observations represent the training sets
- Orange observations represent the test sets.
Performing cross-validation based on this series of train-test sets means:
- Fit the models to each of the training sets.
- For each model fitted to each training set, evaluate the forecast errors on their corresponding test set.
- Compute the average of the resulting forecast errors over all the train-test splits.
Result: average errors over all the training datasets of each model for the forecast horizon being considered (in the example h = 1)
Example 2: multi-step forecasts
For multi-step forecasts, the same strategy is followed. This time the test set is at a distance h (the forecast horizon) from the train set.
Suppose we are interested in models that produce good 4-step-ahead forecasts. The corresponding train-test splits would be:
Again, performing cross-validation based on this series of train-test sets means:
- Fit the models to each of the training sets.
- For each model fitted to each training set, evaluate the forecast errors on their corresponding test set.
- Compute the average of the resulting forecast errors over all the train-test splits.
Result: average errors over all the training datasets of each model for the forecast horizon being considered (in the example h = 4)
Example 3: cross-validation for multiple forecast horizons
In practice we will be interested in analyzing which model perform best for a series of forecast horizons, not just for a specific forecast horizon. For this purpose we need train-test splits that look like those in the figure below. The difference is that now the test-sets include more than one forecast horizon.
- With the splits in the example, we could analyze forecast horizons of up to h=4.
With these train-test splits, cross validation can be performed at 2 levels:
- Performance of the different models for each forecast horizon separately.
- Result: average performance metrics over all training datasets for each combination of:
- Model.
- Forecast horizon.
- Result: average performance metrics over all training datasets for each combination of:
- Performance of the different models on average for all forecast horizons.
- Result: average performance metrics over all training datasets and forecast horizons for each model.
Option 1: performance of the different models for each forecast horizon separately
- Fit the models to each of the training sets.
- For each model fitted to each training set, evaluate the forecast errors on their corresponding test set, separately for each forecast horizon.
- For each combination of model and forecast horizon, compute the average of the resulting errors over all the train-test splits.
Result: average errors of each model over all the training datasets for every forecast horizon being considered (in the example for h = 1, 2, 3 and 4).
- That is, we would obtain h results per model (average error for each forecast horizon).
Option 2: performance of the different models averaged over all the forecast horizons
- Fit the models to each of the training sets.
- For each model fitted to each training set, evaluate the forecast errors on their corresponding test set, separately for each forecast horizon.
- For each combination of model and forecast horizon, compute the average of the resulting errors over all the train-test splits.
- For each model compute the average over all forecast horizons.
Result: average errors of each model over all the training datasets and forecast horizons.
- That is, we would obtain 1 result per model (error averaged over all forecast horizons)
Function stretch_tsibble()
Let us move to the practical aspects of how to obtain these multiple train-test splits in R.
The function stretch_tsibble() is used in the fable library to create many training sets to use in a cross-validation context (this library is loaded along with fpp3). Arguments:
stretch_tsibble(.x, .step = 1, .init = 1)
.x: tsibble to be split following the time series cross validation strategy..step: a positive integer specifting the growth rate of the training sets. That is, how many observations at a time are added to the training datasets. Default:.step = 1..init: a positive integer for an initial size of the training set. Default:.init = 1.
Remember the indications given above about the minimum sizes of the training datasets.
The figure below summarizes the meaning of this arguments applied to one of the examples given before (you can zoom in in the .html file to enlarge it).
Detailed example of time series cross validation
We will use the data for google stock closing prices on 2015:
google_stock <- gafa_stock %>%
filter(Symbol == "GOOG", year(Date) >= 2015) %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)google_2015 <- google_stock %>% filter(year(Date) == 2015)Step 1. Create the set of training datasets using stretch_tsibble():
google_2015_tr <- google_2015 %>%
stretch_tsibble(.init = 20, .step = 1) %>%
relocate(Date, Symbol, .id) # reorder column
google_2015_tr# A tsibble: 31,688 x 10 [1]
# Key: Symbol, .id [233]
Date Symbol .id Open High Low Close Adj_Close Volume day
<date> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 2015-01-02 GOOG 1 526. 528. 521. 522. 522. 1447600 1
2 2015-01-05 GOOG 1 520. 521. 510. 511. 511. 2059800 2
3 2015-01-06 GOOG 1 512. 513. 498. 499. 499. 2899900 3
4 2015-01-07 GOOG 1 504. 504. 497. 498. 498. 2065100 4
5 2015-01-08 GOOG 1 495. 501. 488. 500. 500. 3353600 5
6 2015-01-09 GOOG 1 502. 502. 492. 493. 493. 2069400 6
7 2015-01-12 GOOG 1 492. 493. 485. 490. 490. 2322400 7
8 2015-01-13 GOOG 1 496. 500. 490. 493. 493. 2370500 8
9 2015-01-14 GOOG 1 492. 500. 490. 498. 498. 2235700 9
10 2015-01-15 GOOG 1 503. 503. 495. 499. 499. 2715800 10
# ℹ 31,678 more rows
# Number of training datasets generated
google_2015_tr %>% pull(.id) %>% max()[1] 233
As you can see, 233 training datasets have been generated
- The column
.idis an identifier for each of the training sets. - Because
.initwas set to 20, the first training set has 20 elements - Becase
.stepwas set to 1, the training set is increased one observation at a time
Step 2: fit the models to be assessed to the training datasets generated before
The following code fits a drift and a mean model to each of the training datasets! (2 x 233 = 466 models!)
google_2015_fit_cv <- google_2015_tr %>%
model(
Drift = RW(Close ~ drift()),
Mean = MEAN(Close)
)
google_2015_fit_cv# A mable: 233 x 4
# Key: Symbol, .id [233]
Symbol .id Drift Mean
<chr> <int> <model> <model>
1 GOOG 1 <RW w/ drift> <MEAN>
2 GOOG 2 <RW w/ drift> <MEAN>
3 GOOG 3 <RW w/ drift> <MEAN>
4 GOOG 4 <RW w/ drift> <MEAN>
5 GOOG 5 <RW w/ drift> <MEAN>
6 GOOG 6 <RW w/ drift> <MEAN>
7 GOOG 7 <RW w/ drift> <MEAN>
8 GOOG 8 <RW w/ drift> <MEAN>
9 GOOG 9 <RW w/ drift> <MEAN>
10 GOOG 10 <RW w/ drift> <MEAN>
# ℹ 223 more rows
The resulting mable has, as expected:
- 233 rows, because there are a total of 233 training datasets (think of each training dataset as a separate time series)
- The combination of the columns
Symboland.idindicate to which training dataset the model has been fit. - The columns
DriftandMean(user choices to name the models) correspond to the models fitted to each of these training datasets.
Step 3: generate forecasts
Let us generate forecasts of a horizon of up to 8 training days for each of those models:
# TSCV accuracy
google_2015_fc_cv <- google_2015_fit_cv %>% forecast(h = 8)
google_2015_fc_cv# A fable: 3,728 x 6 [1]
# Key: Symbol, .id, .model [466]
Symbol .id .model day Close .mean
<chr> <int> <chr> <dbl> <dist> <dbl>
1 GOOG 1 Drift 21 N(532, 102) 532.
2 GOOG 1 Drift 22 N(533, 213) 533.
3 GOOG 1 Drift 23 N(533, 335) 533.
4 GOOG 1 Drift 24 N(534, 467) 534.
5 GOOG 1 Drift 25 N(534, 609) 534.
6 GOOG 1 Drift 26 N(535, 762) 535.
7 GOOG 1 Drift 27 N(535, 924) 535.
8 GOOG 1 Drift 28 N(536, 1097) 536.
9 GOOG 1 Mean 21 N(510, 220) 510.
10 GOOG 1 Mean 22 N(510, 220) 510.
# ℹ 3,718 more rows
The output of the forecast table (fable) is, as expected:
- 8 forecasts for each combination of
Symbol,.idand.model. Therefore a total of 8 x 2 x 233 = 3278 point forecasts (column.mean) and their corresponding forecast distributions (columnClose).
Step 4: generate cross-validated accuracy metrics:
4.1 Option 1: generate accuracy metrics for each combination of model and forecast horizon.
To achieve this, we first need to add an additional column to the foregoing forecast table to identify to which horizon each forecast corresponds.
- That is, we need a row counter that resets everytime the combination of .id and model changes.
- Pay attention to the line
as_fable(response = "Close", distribution = Close). This line ensures that, after performunggroup_by()followed bymutate()andungroup(), the result of the operation is still afable(forecast table).- This is very important for the function
àccuracy()to work properly in the subsequent steps.
- This is very important for the function
google_2015_fc_cv <- google_2015_fc_cv %>%
group_by(.id, .model) %>%
mutate(h = row_number()) %>%
ungroup() %>%
as_fable(response = "Close", distribution = Close) %>%
select(h, everything()) # Reorder columns
google_2015_fc_cv# A fable: 3,728 x 7 [1]
# Key: Symbol, .id, .model [466]
h Symbol .id .model day Close .mean
<int> <chr> <int> <chr> <dbl> <dist> <dbl>
1 1 GOOG 1 Drift 21 N(532, 102) 532.
2 2 GOOG 1 Drift 22 N(533, 213) 533.
3 3 GOOG 1 Drift 23 N(533, 335) 533.
4 4 GOOG 1 Drift 24 N(534, 467) 534.
5 5 GOOG 1 Drift 25 N(534, 609) 534.
6 6 GOOG 1 Drift 26 N(535, 762) 535.
7 7 GOOG 1 Drift 27 N(535, 924) 535.
8 8 GOOG 1 Drift 28 N(536, 1097) 536.
9 1 GOOG 1 Mean 21 N(510, 220) 510.
10 2 GOOG 1 Mean 22 N(510, 220) 510.
# ℹ 3,718 more rows
As you can see now the column h identifies to which forecast horizon each forecast corresponds. It resets every time the value of the column .id or .model changes, because in the preceding coude we had performed group_by(.id, .model).
Once this has been done, we can use the accuracy() function to compute the metrics for each combination of .model and h giving it an additional argument .by that specifies the aggregation level for the accuracy metrics:
summary_cv_h <- google_2015_fc_cv %>%
accuracy(google_2015, by = c(".model", "h", "Symbol"))Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
8 observations are missing between 253 and 260
In the preceding code, pay attention to the line accuracy(google_2015, by = c(".model", "h", "Symbol")):
- We gave
google_2015as an argument to accuracy. This is the original dataset, prior to applying the functionstretch_tsibbleto generate the training datasets. by = c(".model", "h", "Symbol")specifies the desired aggregation level for the accuracy metrics.- NOTE: in this example, since we are deailing with a single time series, we could omit the Key identifier of this time series (
Symbol), because it only takes one value (GOOG). But if you have a dataset that contains multiple time series, you will need to include it.
- NOTE: in this example, since we are deailing with a single time series, we could omit the Key identifier of this time series (
Let us inspect the result:
summary_cv_h# A tibble: 16 × 12
.model h Symbol .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Drift 1 GOOG Test 0.430 11.3 7.17 0.0515 1.16 1.01 1.01 0.0859
2 Drift 2 GOOG Test 0.951 16.8 10.6 0.113 1.72 1.49 1.50 0.504
3 Drift 3 GOOG Test 1.51 20.5 13.3 0.181 2.17 1.87 1.83 0.669
4 Drift 4 GOOG Test 2.04 23.4 15.3 0.249 2.47 2.14 2.09 0.747
5 Drift 5 GOOG Test 2.51 25.8 16.8 0.307 2.73 2.36 2.31 0.789
6 Drift 6 GOOG Test 2.98 27.9 18.4 0.367 2.98 2.58 2.49 0.827
7 Drift 7 GOOG Test 3.47 29.9 20.1 0.431 3.25 2.82 2.67 0.835
8 Drift 8 GOOG Test 3.93 31.5 21.1 0.490 3.41 2.96 2.82 0.857
9 Mean 1 GOOG Test 60.5 85.5 62.6 8.87 9.26 8.78 7.64 0.976
10 Mean 2 GOOG Test 61.1 86.1 63.2 8.95 9.35 8.87 7.70 0.976
11 Mean 3 GOOG Test 61.7 86.7 63.8 9.04 9.44 8.95 7.75 0.976
12 Mean 4 GOOG Test 62.3 87.4 64.4 9.13 9.53 9.04 7.81 0.976
13 Mean 5 GOOG Test 62.9 88.0 65.1 9.22 9.63 9.13 7.86 0.976
14 Mean 6 GOOG Test 63.5 88.6 65.6 9.31 9.71 9.21 7.92 0.976
15 Mean 7 GOOG Test 64.1 89.2 66.2 9.40 9.80 9.29 7.97 0.977
16 Mean 8 GOOG Test 64.7 89.8 66.8 9.48 9.88 9.37 8.02 0.977
As you can see, the result details the accuracy metrics (columns ME to ACF1) for each combination of .model and .h. That is, for each combination of model and forecast horizon.
The output above contains, as expected:
- 16 rows, one row for each combination of model (
.model) and forecast horizon (h) - One column for each of the accuracy metrics returned by
accuracy().- In this course, we will limit ourselves to the ones explained in class.
Each of the rows above shows the average error metrics over all the training datasets (233 training sets) of each models fitted, separated by forecast horizon. Some examples:
- The row corresponding to
h = 1andmodel = Driftcontains the average error metrics (MAE, RMSE…) of each of the 233 Drift models that were fitted to their corersponding training dataset (233 training sets) when forecasting one step ahead on their corresponding test datasets. - The row corresponding to
h = 4andmodel = Driftcontains the average error metrics (MAE, RMSE…) of each of the 233 Drift models that were fitted to their corresponding training dataset (233 training sets) when forecasting the value four steps ahead on their corresponding test datasets.
This is a much more statistically robust method than just performing a single train-test split.
A possible good way to choose the best performing forecasting model: find the model with the smallest RMSE for the forecast horizon we wish to compute using time series cross-validation. As explained in the session on point forecast accuracy, minimizing RMSE will lead to forecasts on the mean.
summary_cv_h %>%
ggplot(aes(x = h, y = RMSE, color = .model)) +
geom_point()As you can see the graph indicates that the average performance of the drift model over all the train-test splits is much better (smaller errors) than that of the mean method for this particular time series for all the forecast horizons analyzed.
4.2 Option 2: generate average accuracy metrics for each model averaged over all the forecasting horizons.
Perhaps we are not interested in producing forecasts with a specific horizon, but rather we would like a model that returns the best forecasts on average over all the horizons.
To attain this, we use again the accuracy() function but we do not provide an argument for by. It will therefore default to grouping solely by .model and the Key identifier of the time series. Let us check this:
google_2015_fc_cv %>%
accuracy(google_2015)Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
8 observations are missing between 253 and 260
# A tibble: 2 × 11
.model Symbol .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Drift GOOG Test 2.21 24.2 15.3 0.272 2.48 2.15 2.16 0.814
2 Mean GOOG Test 62.6 87.7 64.7 9.17 9.57 9.08 7.83 0.966
This is equivalent to doing:
google_2015_fc_cv %>%
accuracy(google_2015, by = c(".model", "Symbol"))Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
8 observations are missing between 253 and 260
# A tibble: 2 × 11
.model Symbol .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Drift GOOG Test 2.21 24.2 15.3 0.272 2.48 2.15 2.16 0.814
2 Mean GOOG Test 62.6 87.7 64.7 9.17 9.57 9.08 7.83 0.966
The above metrics are the average performance metrics of each of the model types over all the train-test splits and forecast horizons. For example:
- The row corresponding to the
Driftcontains the error metrics of the drift model averaged over the 233 train-test splits and over all the forecast horizons.
Exercise cross-validation
Task
Perform cross validation on the australian beer production dataset. Use the dataset recent_production created during this lesson.
Step 0. Identify your dataset and think about which variables you need
- We will work with
recent_production, specifically with the production of beer, which we will store inbeer_production. We will applystretch_tsibble()to this time series to generate our set training datasets. - For convenience I will select only the varialbes relevant to the analysis
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)beer_production <- select(recent_production, Quarter, Beer)
beer_production# A tsibble: 74 x 2 [1Q]
Quarter Beer
<qtr> <dbl>
1 1992 Q1 443
2 1992 Q2 410
3 1992 Q3 420
4 1992 Q4 532
5 1993 Q1 433
6 1993 Q2 421
7 1993 Q3 410
8 1993 Q4 512
9 1994 Q1 449
10 1994 Q2 381
# ℹ 64 more rows
Step 1. Create the set of training datasets using stretch_tsibble()
Step 2: Fit the models
Step 3: generate the forecasts
Consider a forecast horizon of up to 4.